{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HTML解析简介\n",
    "\n",
    "*  本周主要内容：HTML解析（parse HTML）及Xpath实践\n",
    "*  20春_Web数据挖掘_week02\n",
    "*  电子讲义原设计者：廖汉腾, 许智超\n",
    "*  电子讲义练习改写者：XXX\n",
    "  * 本周电子讲义互评工作坊，依序做以下动作：\n",
    "     * 在e.nfu.edu.cn下载此文档\n",
    "     * 在自己本地端实操，把ipynb文档中的123456789改名为学号\n",
    "     * <mark>在还没有改动之前，先把此后缀为学号ipynb文档上传至Github为第一版</mark> (其它文档不计)\n",
    "     * 在自己本地端实操，练习所有内容，按需增减本讲义内容，含代码丶markdown丶新数据(含其连结)及\n",
    "     * 及格至少要做: <mark>**\"本周小结内容\" 以markdown语法，按上课及本电子讲义补充内容进行150-500字的摘要说明**</mark>，可利用HTML文内超连结连到同文档其他的笔记内容\n",
    "     * 互评时会要求提交自己文档的改动比较，以方便同学观看你的改动范围及内容\n",
    "  * 本周加分项，以抢快为主，<mark>1人最多只能抢1项</mark>，需以指定的url进行数据挖掘并输出excel，抢快时间<mark>首先</mark>以该代码在Github的提交时间为准，若两人Github提交时间相差不到3分钟，则以<mark>再以</mark>QQ群@老师时间为判断\n",
    "      * C-3 期末总分加1\n",
    "      * C-4 期末总分加2\n",
    "      * C-5 期末总分加5\n",
    "-----\n",
    "![for humans](https://requests-html.kennethreitz.org/_static/requests-html-logo.png)\n",
    "\n",
    "## 复习\n",
    "\n",
    "复习：上周内容，总观使用\n",
    "\n",
    "* requests-html  丶\n",
    "* pd.read_html 丶及\n",
    "* requests + lxml \n",
    "\n",
    "的Web数据挖掘内容，最主要包括以下前后的主要数据挖掘内容\n",
    "\n",
    "1. 使用 HTTP 发送请求（HTTP request）\n",
    "2. 判断 HTTP 及状态（HTTP status code） 及 HTTP 响应（HTTP response）是否正常\n",
    "3. 执行 HTML 解析（parse HTML），通常使用 xpath or CSS selector 选择器\n",
    "\n",
    "<br/>\n",
    "<br/>\n",
    "\n",
    "-----\n",
    "![Xpath Axis](http://krum.rz.uni-mannheim.de/inet-2005/images/xpath-axis.gif)\n",
    "\n",
    "\n",
    "## 本周内容及学习目标\n",
    "\n",
    "本周内容聚焦在第3.部分\n",
    "挑选比较容易Web数据挖掘的网页（i.e. 比较没有以上1. 及2. 的坑），学习解决以下挑战：\n",
    "\n",
    "1. 使用 requests-html 爬取并存取网页文字档，查找[requests-html 中文文档](https://cncert.github.io/requests-html-doc-cn/#/)\n",
    "2. 熟悉 [xpath 语法](https://www.w3cschool.cn/xpath/xpath-syntax.html)丶[xpath 节点](https://www.w3cschool.cn/xpath/xpath-nodes.html)\n",
    "3. 使用 [xpath cheatsheet](https://devhints.io/xpath)\n",
    "  * 在 Chrome Inspector 使用\n",
    "  * 在 requests-html (Python) 使用\n",
    "4. 简易使用 [pd.DataFrame]()\n",
    "\n",
    "学生将实践\n",
    "* 解析简单HTML页面\n",
    "* 使用xpath（不挑greedy vs. 及挑剔ungreedy的策略）\n",
    "* 获取标签tags丶属性attributes丶值values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "/* 本电子讲义使用之CSS */\n",
       "div.code_cell {\n",
       "    background-color: #e5f1fe;\n",
       "}\n",
       "div.cell.selected {\n",
       "    background-color: #effee2;\n",
       "    font-size: 2rem;\n",
       "    line-height: 2.4rem;\n",
       "}\n",
       "div.cell.selected .rendered_html table {\n",
       "    font-size: 2rem !important;\n",
       "    line-height: 2.4rem !important;\n",
       "}\n",
       ".rendered_html pre code {\n",
       "    background-color: #C4E4ff;   \n",
       "    padding: 2px 25px;\n",
       "}\n",
       ".rendered_html pre {\n",
       "    background-color: #99c9ff;\n",
       "}\n",
       "div.code_cell .CodeMirror {\n",
       "    font-size: 2rem !important;\n",
       "    line-height: 2.4rem !important;\n",
       "}\n",
       ".rendered_html img, .rendered_html svg {\n",
       "    max-width: 35%;\n",
       "    height: auto;\n",
       "    float: right;\n",
       "}\n",
       "/* Gradient transparent - color - transparent */\n",
       "hr {\n",
       "    border: 0;\n",
       "    border-bottom: 1px dashed #ccc;\n",
       "}\n",
       ".emoticon{\n",
       "    font-size: 5rem;\n",
       "    line-height: 4.4rem;\n",
       "    text-align: center;\n",
       "    vertical-align: middle;\n",
       "}\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<style>\n",
    "/* 本电子讲义使用之CSS */\n",
    "div.code_cell {\n",
    "    background-color: #e5f1fe;\n",
    "}\n",
    "div.cell.selected {\n",
    "    background-color: #effee2;\n",
    "    font-size: 2rem;\n",
    "    line-height: 2.4rem;\n",
    "}\n",
    "div.cell.selected .rendered_html table {\n",
    "    font-size: 2rem !important;\n",
    "    line-height: 2.4rem !important;\n",
    "}\n",
    ".rendered_html pre code {\n",
    "    background-color: #C4E4ff;   \n",
    "    padding: 2px 25px;\n",
    "}\n",
    ".rendered_html pre {\n",
    "    background-color: #99c9ff;\n",
    "}\n",
    "div.code_cell .CodeMirror {\n",
    "    font-size: 2rem !important;\n",
    "    line-height: 2.4rem !important;\n",
    "}\n",
    ".rendered_html img, .rendered_html svg {\n",
    "    max-width: 35%;\n",
    "    height: auto;\n",
    "    float: right;\n",
    "}\n",
    "/* Gradient transparent - color - transparent */\n",
    "hr {\n",
    "    border: 0;\n",
    "    border-bottom: 1px dashed #ccc;\n",
    "}\n",
    ".emoticon{\n",
    "    font-size: 5rem;\n",
    "    line-height: 4.4rem;\n",
    "    text-align: center;\n",
    "    vertical-align: middle;\n",
    "}\n",
    "</style>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 基本模块\n",
    "import pandas as pd\n",
    "from requests_html import HTMLSession"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# requsts-html\n",
    "学生将实践\n",
    "* 解析简单HTML页面\n",
    "\n",
    "使用 requests-html 爬取并存取网页文字档，查找[requests-html 中文文档](https://cncert.github.io/requests-html-doc-cn/#/)\n",
    "\n",
    "* API 文档\n",
    "  * HTML类\n",
    "  * Element类\n",
    "  * HTML Sessions (应正名为HTTP Sessions)  \n",
    "* [原文档](https://requests-html.kennethreitz.org//_modules/requests_html.html)\n",
    "\n",
    "要点：HTTP 和 HTML 的分工与合作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HTML类\n",
    "\n",
    "HTML文本的基本使用及保存备用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A1  nfu.edu.cn 搜 文学与传媒学院 保存备用\n",
    "payload = {\n",
    "    \"keyword\":\"文学与传媒学院\",\n",
    "    \"p\":\"1\"\n",
    "}\n",
    "session = HTMLSession()\n",
    "r = session.get(\"http://www.nfu.edu.cn/index.php/home/article/search.html\", params=payload)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\ufeff<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\\r\\n<html>\\r\\n<head>\\r\\n<meta name=\"renderer\" content=\"webkit\">\\r\\n<meta http-equiv=\"x-ua-compatible\" content=\"IE=edge\" >\\r\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\\r\\n<title>-中山大学南方学院 </title>\\r\\n<meta name=\"keywords\" content=\"\">\\r\\n<meta name=\"description\" content=\"\">\\r\\n\\t\\n<link rel=\"stylesheet\" type=\"text/css\" href=\"/Public/Home/css/swiper-3.3.1.min.css\"/>\\n\\t\\t<link href=\"/Public/Home/css/lin.css\" rel=\"stylesheet\" type=\"text/css\" />\\n\\n\\t\\t<script src=\"/Public/Home/js/jquery-1.11.3.min.js\"></script>\\n\\t\\t<script src=\"/Public/Home/js/jquery-1.11.1.js\"></script>\\n\\t\\t<script src=\"/Public/Home/js/jquery.easie-min.js\" type=\"text/javascript\"></script>\\n\\t\\t<script src=\"/Public/Home/js/swiper.min.js\" type=\"text/javascript\"></script>\\n\\t\\t<script src=\"/Public/Home/js/lin.js\"></script>\\n\\t\\t\\n\\t\\t\\n<link href=\"/Public/Home/page.css\" rel=\"stylesheet\" type=\"text/css\" />\\n<link href=\"/Public/favicon.ico\" rel=\"Shortcut Icon\">\\n<link href=\"/Public/favicon.ico\" rel=\"Bookmark\">\\n\\r\\n\\t</head>\\r\\n<body>\\r\\n\\ufeff<!--头部-->\\n\\t\\t<div class=\"lin-header \">\\n\\t\\t\\t<div class=\"lin-head clearfix\">\\n\\t\\t\\t\\t<h1 class=\"lin-topl\"><a href=\"/index.php\" target=\"_blank\" title=\"中山大学南方学院\"><img src=\"/Public/Home/images/logo.png\"/></a></h1>\\n\\t\\t\\t\\t<div class=\"lin-topr\">\\n\\t\\t\\t\\t\\t<div class=\"lin-youxiang\">\\n\\t\\t\\t\\t\\t\\t<a href=\"http://oa.nfu.edu.cn/\" target=\"_blank\">办公系统</a>\\n\\t\\t\\t\\t\\t\\t<a href=\"http://en.nfu.edu.cn/\">English Version</a>\\n\\t\\t\\t\\t\\t\\t<!-- <a href=\"https://mail.nfu.edu.cn/\" target=\"_blank\">邮箱登录</a>\\n\\t\\t\\t\\t\\t\\t<a href=\"mailto:nfcsysuyz@126.com\" target=\"_blank\" title=\"nfcsysuyz@126.com\" >院长信箱</a> -->\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t<div class=\"lin-ser lin-serhide\">\\n\\t\\t\\t\\t\\t\\t<div class=\"serbox\">\\n\\t\\t\\t\\t\\t\\t<form action=\"/index.php/home/article/search.html\" method=\"get\" id=\"search_form\">\\n\\t\\t\\t\\t\\t\\t\\t<input type=\"text\" name=\"keyword\" id=\"keyword\" placeholder=\"搜索\" />\\n\\t\\t\\t\\t\\t\\t\\t<a href=\"javascript:;\" id=\"search_btn\" ></a>\\n\\t\\t\\t\\t\\t\\t</form>\\t\\n\\t\\t\\t\\t\\t\\t<script type=\"text/javascript\">\\n\\t\\t\\t\\t\\t\\t\\t$(\"#search_btn\").click(function(){\\n\\t\\t\\t\\t\\t\\t\\t\\tvar keyword=$(\"#keyword\").val();\\n\\t\\t\\t\\t\\t\\t\\t\\tif(keyword==\\'\\'){\\n\\t\\t\\t\\t\\t\\t\\t\\t\\talert(\\'* 请输入搜索关键词 !\\');\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t$(\"#keyword\").focus();\\n\\t\\t\\t\\t\\t\\t\\t\\t\\treturn false;\\n\\t\\t\\t\\t\\t\\t\\t\\t}else{\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t$(\"#search_form\").submit();\\n\\t\\t\\t\\t\\t\\t\\t\\t}\\n\\t\\t\\t\\t\\t\\t\\t})\\n\\t\\t\\t\\t\\t\\t</script>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t\\t<!-- <span class=\"ser-biaoti\"><a href=\\'\\' style=\"color:#fff;\">English Version</a></span> -->\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t</div>\\n\\t\\t</div>\\n\\t\\t<!-- end 头部-->\\n\\t\\t<!--导航条-->\\n\\t\\t<div class=\"lin-navbar\">\\n\\t\\t\\t<p class=\"navnav\">\\n\\t\\t\\t\\t<span></span>\\n\\t\\t\\t\\t<span></span>\\n\\t\\t\\t\\t<span></span>\\n\\t\\t\\t</p>\\n\\t\\t\\t<ul class=\"lin-nav clearfix\">\\n\\t\\t\\t\\t<li  class=\"lin-navli\"><a href=\"/index.php\">首页</a>\\n\\t\\t\\t\\t</li>\\n\\t\\t\\t\\t<li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/29.html\"  target=\"_blank\">学校概况</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/29.html\" target=\"_self\">学校简介</a></li><li><a href=\"/index.php/home/article/index/cid/30.html\" target=\"_blank\">现任领导</a></li><li><a href=\"/index.php/home/article/index/cid/135.html\" target=\"_self\">校徽  校训  校歌</a></li><li><a href=\"/index.php/home/article/index/cid/34.html\" target=\"_blank\">南方大事记</a></li><li><a href=\"/index.php/home/article/index/cid/104.html\" target=\"_self\">学校校历</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/2.html\"  target=\"_self\">党建之窗</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/61.html\"  target=\"_blank\">机构设置</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/61.html\" target=\"_self\">院系设置</a></li><li><a href=\"/index.php/home/article/index/cid/36.html\" target=\"_self\">管理机构</a></li><li><a href=\"/index.php/home/article/index/cid/165.html\" target=\"_self\">常设委员会</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/31.html\"  target=\"_blank\">人才培养</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/31.html\" target=\"_blank\">名师介绍</a></li><li><a href=\"/index.php/home/article/index/cid/163.html\" target=\"_self\">本科教育</a></li><li><a href=\"/index.php/home/article/index/cid/164.html\" target=\"_self\">继续教育</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/106.html\"  target=\"_blank\">教学科研</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/106.html\" target=\"_blank\">教务与科研部</a></li><li><a href=\"/index.php/home/article/index/cid/127.html\" target=\"_blank\">科研信息与动态</a></li><li><a href=\"/index.php/home/article/index/cid/107.html\" target=\"_blank\">科研机构</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/49.html\"  target=\"_blank\">招生就业</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/49.html\" target=\"_blank\">本科招生</a></li><li><a href=\"/index.php/home/article/index/cid/129.html\" target=\"_self\">继续教育</a></li><li><a href=\"/index.php/home/article/index/cid/50.html\" target=\"_blank\">就业服务</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/79.html\"  target=\"_blank\">图书馆</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/79.html\" target=\"_blank\">图书馆</a></li><li><a href=\"/index.php/home/article/index/cid/80.html\" target=\"_blank\">档案室</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/159.html\"  target=\"_blank\">合作交流</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/159.html\" target=\"_blank\">国际交流</a></li><li><a href=\"/index.php/home/article/index/cid/161.html\" target=\"_blank\">外事服务</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/44.html\"  target=\"_blank\">人才招聘</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/44.html\" target=\"_blank\">教师系列</a></li><li><a href=\"/index.php/home/article/index/cid/45.html\" target=\"_blank\">管理系列</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/32.html\"  target=\"_blank\">走进南方</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/32.html\" target=\"_blank\">图说南方</a></li><li><a href=\"/index.php/home/article/index/cid/105.html\" target=\"_self\">生活服务</a></li><li><a href=\"/index.php/home/article/index/cid/87.html\" target=\"_self\">医疗服务</a></li><li><a href=\"/index.php/home/article/index/cid/51.html\" target=\"_blank\">校报</a></li><li><a href=\"/index.php/home/article/index/cid/82.html\" target=\"_self\">交通指引</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li>\\n\\t\\t\\t</ul>\\n\\t\\t\\t\\n\\t\\t</div>\\n\\t\\t<div class=\"lin-navbg\"></div>\\n\\n\\r\\n<div class=\"lin-content\">\\r\\n\\t\\t\\t<div class=\"lin-neiye clearfix\">\\r\\n\\t\\t\\t\\t\\r\\n\\r\\n\\t\\t\\t    <div class=\"search_list_right\">\\r\\n\\t\\t\\t        <div class=\"fan clearfix\">\\r\\n\\t\\t\\t            <span class=\"fan_title\">站内搜索</span>\\r\\n\\t\\t\\t            <span class=\"fan_right\">您当前位置是：<a href=\"/index.php\">网站首页</a> &gt; <font>站内搜索</font></span>\\r\\n\\t\\t\\t        </div>\\r\\n\\t\\t\\t        <div class=\"ny_content\">\\r\\n\\t\\t\\t\\t\\t\\t<ul class=\"list-ul\">\\r\\n\\t\\t\\t\\t\\t\\t<li><font class=\"right-more\">2020-01-06</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6363.html\" target=\"_blank\" title=\"文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会\"><font color=red>文学与传媒学院</font>教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2020-01-06</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6366.html\" target=\"_blank\" title=\"文学与传媒学院2019年学术研讨会暨总结大会顺利召开\"><font color=red>文学与传媒学院</font>2019年学术研讨会暨总结大会顺利召开</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-12-20</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6318.html\" target=\"_blank\" title=\"展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕\">展现当代青年的迷惘与奋进——我校<font color=red>文学与传媒学院</font>大型原创舞台剧《春至》圆满落幕</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-22</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6154.html\" target=\"_blank\" title=\"文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束\"><font color=red>文学与传媒学院</font>考研座谈暨2020年考研交流答疑会圆满结束</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-05</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5348.html\" target=\"_blank\" title=\"文学与传媒学院教师招聘启事\"><font color=red>文学与传媒学院</font>教师招聘启事</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-04</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6016.html\" target=\"_blank\" title=\"创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行\">创意无限，未来可期——<font color=red>文学与传媒学院</font>青马工程第四讲暨闭营仪式顺利举行</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-04</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6019.html\" target=\"_blank\" title=\"垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行\">垃圾分类我先行——<font color=red>文学与传媒学院</font>“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-09-16</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5794.html\" target=\"_blank\" title=\"以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束\">以梦为马，不负韶华——<font color=red>文学与传媒学院</font>2019级新生开学典礼圆满结束</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-09-09</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5776.html\" target=\"_blank\" title=\"文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖\"><font color=red>文学与传媒学院</font>学子在全国高校数字艺术设计大赛中斩获大奖</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-09-09</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5777.html\" target=\"_blank\" title=\"文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩\"><font color=red>文学与传媒学院</font>学子在第七届中国大学生公共关系策划大赛中喜获佳绩</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-06-24</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5642.html\" target=\"_blank\" title=\"倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕\">倾心之作，致敬经典——<font color=red>文学与传媒学院</font>紫阳戏剧社《倾城之恋》话剧展演圆满落幕</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-06-24</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5647.html\" target=\"_blank\" title=\"毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展\">毕业季 | 今朝有离别，青春不散场 ——<font color=red>文学与传媒学院</font>2019届毕业生毕业季系列活动有序开展</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li>\\t\\t\\t\\t\\t\\t</ul>\\r\\n\\t\\t\\t\\t\\t\\t<div style=\"clear: both;\"></div>\\r\\n\\t\\t\\t\\t\\t\\t<div class=\"pages\" align=\"center\"><div>  <span class=\"current\">1</span><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html\">2</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/3.html\">3</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/4.html\">4</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/5.html\">5</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/6.html\">6</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/7.html\">7</a> <a class=\"next\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html\">>></a> </div></div>\\r\\n\\t\\t\\t        </div>\\r\\n\\t\\t\\t    </div>\\r\\n\\t\\t\\t</div>\\r\\n\\t\\t</div>\\r\\n\\t\\t<!-- end 内容区域-->\\r\\n\\r\\n\\t<!--底部-->\\n\\t\\t<div class=\"lin-footer\">\\n\\t\\t\\t<div class=\"lin-fer clearfix\">\\n\\t\\t\\t\\t<div class=\"ferleft\">\\n\\t\\t\\t\\t\\t<ul class=\"fer-ul clearfix\">\\n\\t\\t\\t\\t\\t\\t<li class=\"fer-li\"><a href=\"http://www.moe.gov.cn/\" target=\"_blank\" title=\"教育部\">教育部</a></li><li class=\"fer-li\"><a href=\"http://www.gz.gov.cn/\" target=\"_blank\" title=\"广州市政府\">广州市政府</a></li><li class=\"fer-li\"><a href=\"http://www.cnki.net/\" target=\"_blank\" title=\"中国知网\">中国知网</a></li><li class=\"fer-li\"><a href=\"http://edu.gd.gov.cn\" target=\"_blank\" title=\"广东省教育厅\">广东省教育厅</a></li><li class=\"fer-li\"><a href=\"http://www.gdpr.com/\" target=\"_blank\" title=\"珠江投资\">珠江投资</a></li><li class=\"fer-li\"><a href=\"http://journal.nfu.edu.cn/CN/volumn/home.shtml\" target=\"_blank\" title=\"南方论丛\">南方论丛</a></li><li class=\"fer-li\"><a href=\"http://www.sysu.edu.cn/\" target=\"_blank\" title=\"中山大学 \">中山大学 </a></li><li class=\"fer-li\"><a href=\"http://www.nfu.edu.cn/index.php/home/article/index/cid/136.html\" target=\"_blank\" title=\"珠江教育联盟\">珠江教育联盟</a></li>\\t\\n\\t\\t\\t\\t\\t\\t<li class=\"fer-li\"><a href=\"/index.php/home/article/link.html\">更多>></a></li>\\n\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t<div class=\"fercen\">\\n\\t\\t\\t\\t\\t<div class=\"fer-er\"><img src=\"/Public/Home/images/erweima1.jpg\"/></div>\\n\\t\\t\\t\\t\\t<div class=\"fer-er\"><img src=\"/Public/Home/images/erweima2.jpg\"/></div>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t<div class=\"ferright\">\\n\\t\\t\\t\\t\\t<div><p><span>地址：广州市从化区温泉大道882号中山大学南方学院</span><span>邮编：510970</span></p></div>\\n\\t\\t\\t\\t\\t<div class=\"addleft\">\\n\\t\\t\\t\\t\\t\\t<p>联系电话：020-61787326</p>\\n\\t\\t\\t\\t\\t\\t<p>版权所有 ©  中山大学南方学院</p>\\n\\t\\t\\t\\t\\t\\t<p>技术支持：<a href=\"http://www.unsun.net\" target=\"_blank\">碧辉腾乐(UNSUN.NET)</a></p>\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t<div class=\"addright\">\\n\\t\\t\\t\\t\\t\\t<p>招生咨询：020-87912619</p> \\n\\t\\t\\t\\t\\t\\t<p>\\n\\t\\t\\t\\t\\t\\t\\t<span class=\"add-spante\"><a target=\"_blank\" href=\"http://www.beian.miit.gov.cn\">粤ICP备11077779号</a></span> \\n\\t\\t\\t\\t\\t\\t\\t<span class=\"add-spante\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<a href=\"/index.php/admin/index/login.html\" target=\"_blank\" >网站管理</a>&nbsp;&nbsp;\\n\\t\\t\\t\\t\\t\\t\\t\\t<a href=\"http://old.nfu.edu.cn/\" target=\"_blank\" >旧站入口</a>\\n\\t\\t\\t\\t\\t\\t\\t</span>\\n\\t\\t\\t\\t\\t\\t</p>\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t<div align=\"center\">\\n\\t\\t\\t\\t\\t\\t<a target=\"_blank\" href=\"http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=44011702000081\" style=\"display:inline-block;text-decoration:none;height:20px;line-height:20px;\"><img src=\"/Public/Home/images/icp.png\" style=\"float:left;\"/><p style=\"float:left;height:20px;line-height:20px;margin: 0px 0px 0px 5px; color:#ffffff;\">粤公网安备 44011702000081号</p></a>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t</div>\\n\\t\\t</div>\\n\\t\\t<!-- end 底部-->\\n\\n <script type=\"text/javascript\" language=\"javascript\">\\n \\n    //加入收藏\\n \\n        function AddFavorite(sURL, sTitle) {\\n \\n            sURL = encodeURI(sURL); \\n        try{   \\n \\n            window.external.addFavorite(sURL, sTitle);   \\n \\n        }catch(e) {   \\n \\n            try{   \\n \\n                window.sidebar.addPanel(sTitle, sURL, \"\");   \\n \\n            }catch (e) {   \\n \\n                alert(\"加入收藏失败，请使用Ctrl+D进行添加,或手动在浏览器里进行设置.\");\\n \\n            }   \\n \\n        }\\n \\n    }\\n \\n    //设为首页\\n \\n    function SetHome(url){\\n \\n        if (document.all) {\\n \\n            document.body.style.behavior=\\'url(#default#homepage)\\';\\n \\n               document.body.setHomePage(url);\\n \\n        }else{\\n \\n            alert(\"您好,您的浏览器不支持自动设置页面为首页功能,请您手动在浏览器里设置该页面为首页!\");\\n \\n        }\\n \\n    }\\n \\n</script>\\n\\r\\n\\r\\n</body>\\r\\n</html>'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#  A2  nfu.edu.cn 搜 文学与传媒学院 r.html\n",
    "# r.html  (HTML 元素/标签) \n",
    "\n",
    "r.html.html  \n",
    "# 可存网页为文字档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A3  nfu.edu.cn 搜 文学与传媒学院 保存备用\n",
    "\n",
    "with open (\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"w\") as fp:\n",
    "    fp.write(r.html.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A4  复习 读保存备用的任何文字档\n",
    "\n",
    "with open (\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"r\") as fp:\n",
    "    html_load = fp.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 解析文本\n",
    "![HTML head body](https://rlv.zcache.com/head_body_t_shirt-rd222a2cce3704f3b87fae4ee0fb73744_k2gm8_307.jpg)\n",
    "HTML文本的解析\n",
    "\n",
    "```python\n",
    "\n",
    "parsed = requests_html.soup_parse(html_load)\n",
    "```\n",
    "\n",
    "```python\n",
    "\n",
    "import requests_html\n",
    "# parsed = requests_html.soup_parse(html_load)\n",
    "from requests_html import soup_parse\n",
    "# parsed = soup_parse(html_load)\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A5  前方高能 HTML文本的解析\n",
    "\n",
    "import requests_html\n",
    "parsed = requests_html.soup_parse(html_load)\n",
    "解析后 = requests_html.soup_parse(html_load)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Element html at 0x19d2b3ff9f8>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后   # <html> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element body at 0x19d2b4b4048>]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('body')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element head at 0x19d2b4b4138>]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('head')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element a at 0x19d2b4b42c8>,\n",
       " <Element a at 0x19d2b4b4368>,\n",
       " <Element a at 0x19d2b4b4318>,\n",
       " <Element a at 0x19d2b4b43b8>,\n",
       " <Element a at 0x19d2b4b4408>,\n",
       " <Element a at 0x19d2b4b4458>,\n",
       " <Element a at 0x19d2b4b44a8>,\n",
       " <Element a at 0x19d2b4b44f8>,\n",
       " <Element a at 0x19d2b4b4548>,\n",
       " <Element a at 0x19d2b4b4598>,\n",
       " <Element a at 0x19d2b4b45e8>,\n",
       " <Element a at 0x19d2b4b4638>,\n",
       " <Element a at 0x19d2b4b4688>,\n",
       " <Element a at 0x19d2b4b46d8>,\n",
       " <Element a at 0x19d2b4b4728>,\n",
       " <Element a at 0x19d2b4b4778>,\n",
       " <Element a at 0x19d2b4b47c8>,\n",
       " <Element a at 0x19d2b4b4818>,\n",
       " <Element a at 0x19d2b4b4868>,\n",
       " <Element a at 0x19d2b4b48b8>,\n",
       " <Element a at 0x19d2b4b4908>,\n",
       " <Element a at 0x19d2b4b4958>,\n",
       " <Element a at 0x19d2b4b49a8>,\n",
       " <Element a at 0x19d2b4b49f8>,\n",
       " <Element a at 0x19d2b4b4a48>,\n",
       " <Element a at 0x19d2b4b4a98>,\n",
       " <Element a at 0x19d2b4b4ae8>,\n",
       " <Element a at 0x19d2b4b4b38>,\n",
       " <Element a at 0x19d2b4b4b88>,\n",
       " <Element a at 0x19d2b4b4bd8>,\n",
       " <Element a at 0x19d2b4b4c28>,\n",
       " <Element a at 0x19d2b4b4c78>,\n",
       " <Element a at 0x19d2b4b4cc8>,\n",
       " <Element a at 0x19d2b4b4d18>,\n",
       " <Element a at 0x19d2b4b4d68>,\n",
       " <Element a at 0x19d2b4b4db8>,\n",
       " <Element a at 0x19d2b4b4e08>,\n",
       " <Element a at 0x19d2b4b4e58>,\n",
       " <Element a at 0x19d2b4b4ea8>,\n",
       " <Element a at 0x19d2b4b4ef8>,\n",
       " <Element a at 0x19d2b4b4f48>,\n",
       " <Element a at 0x19d2b4b4f98>,\n",
       " <Element a at 0x19d2b4b7048>,\n",
       " <Element a at 0x19d2b4b7098>,\n",
       " <Element a at 0x19d2b4b70e8>,\n",
       " <Element a at 0x19d2b4b7138>,\n",
       " <Element a at 0x19d2b4b7188>,\n",
       " <Element a at 0x19d2b4b71d8>,\n",
       " <Element a at 0x19d2b4b7228>,\n",
       " <Element a at 0x19d2b4b7278>,\n",
       " <Element a at 0x19d2b4b72c8>,\n",
       " <Element a at 0x19d2b4b7318>,\n",
       " <Element a at 0x19d2b4b7368>,\n",
       " <Element a at 0x19d2b4b73b8>,\n",
       " <Element a at 0x19d2b4b7408>,\n",
       " <Element a at 0x19d2b4b7458>,\n",
       " <Element a at 0x19d2b4b74a8>,\n",
       " <Element a at 0x19d2b4b74f8>,\n",
       " <Element a at 0x19d2b4b7548>,\n",
       " <Element a at 0x19d2b4b7598>,\n",
       " <Element a at 0x19d2b4b75e8>,\n",
       " <Element a at 0x19d2b4b7638>,\n",
       " <Element a at 0x19d2b4b7688>,\n",
       " <Element a at 0x19d2b4b76d8>,\n",
       " <Element a at 0x19d2b4b7728>,\n",
       " <Element a at 0x19d2b4b7778>,\n",
       " <Element a at 0x19d2b4b77c8>,\n",
       " <Element a at 0x19d2b4b7818>,\n",
       " <Element a at 0x19d2b4b7868>,\n",
       " <Element a at 0x19d2b4b78b8>,\n",
       " <Element a at 0x19d2b4b7908>,\n",
       " <Element a at 0x19d2b4b7958>,\n",
       " <Element a at 0x19d2b4b79a8>,\n",
       " <Element a at 0x19d2b4b79f8>,\n",
       " <Element a at 0x19d2b4b7a48>,\n",
       " <Element a at 0x19d2b4b7a98>,\n",
       " <Element a at 0x19d2b4b7ae8>]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('//a')  # greedy 所有<html> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('//*[@class=\"news_title\"]//a/@title')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 直接使用\n",
    "\n",
    "```python\n",
    "r.html.xpath()\n",
    "```\n",
    "\n",
    "你来试试?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会', '文学与传媒学院2019年学术研讨会暨总结大会顺利召开', '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕', '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束', '文学与传媒学院教师招聘启事', '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行', '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行', '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束', '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖', '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩', '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕', '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']\n",
      "['/index.php/home/article/search_detail/id/6363.html', '/index.php/home/article/search_detail/id/6366.html', '/index.php/home/article/search_detail/id/6318.html', '/index.php/home/article/search_detail/id/6154.html', '/index.php/home/article/search_detail/id/5348.html', '/index.php/home/article/search_detail/id/6016.html', '/index.php/home/article/search_detail/id/6019.html', '/index.php/home/article/search_detail/id/5794.html', '/index.php/home/article/search_detail/id/5776.html', '/index.php/home/article/search_detail/id/5777.html', '/index.php/home/article/search_detail/id/5642.html', '/index.php/home/article/search_detail/id/5647.html']\n",
      "['2020-01-06', '2020-01-06', '2019-12-20', '2019-11-22', '2019-11-05', '2019-11-04', '2019-11-04', '2019-09-16', '2019-09-09', '2019-09-09', '2019-06-24', '2019-06-24']\n"
     ]
    }
   ],
   "source": [
    "# A6  直接使用 requests-html\n",
    "payload = {\n",
    "    \"keyword\":\"文学与传媒学院\",\n",
    "    \"p\":\"1\"\n",
    "}\n",
    "\n",
    "session = HTMLSession()\n",
    "r = session.get(\"http://www.nfu.edu.cn/index.php/home/article/search.html\", params=payload)\n",
    "\n",
    "# 保存备用 (好习惯, 最好存一个地方)\n",
    "with open (\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"w\") as fp:\n",
    "    fp.write(r.html.html)\n",
    "\n",
    "# 解析文本直接使用\n",
    "print (r.html.xpath('//*[@class=\"news_title\"]/a/@title'))\n",
    "print (r.html.xpath('//*[@class=\"news_title\"]/a/@href'))\n",
    "print (r.html.xpath('//font[@class=\"right-more\"]/text()'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# xpath\n",
    "\n",
    "学生将实践\n",
    "\n",
    "* r.html 如 庖丁解牛\n",
    "* r.html.xapth() 挑牛肉吃\n",
    "  * **元素/标签如筋骨，值和文本通常才有牛肉**\n",
    "  * html以元素/标签构成\n",
    "  * 值和文本不单独存在，必需依附元素/标签\n",
    "* 使用xpath（不挑greedy vs. 及挑剔ungreedy的策略）\n",
    "\n",
    "* 获取标签tags丶属性attributes丶值values\n",
    "  * 掌握Chrome Inspector 多种颜色区分\n",
    "      * 元素/标签 elements/tags  ?色\n",
    "      * 属性attributes  ?色\n",
    "      * 值values ?色\n",
    "      * 文本 ?色\n",
    "      * HTML注解 ?色 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 取值及文本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-A-1 使用 取值 观察xpath最后的内容\n",
    "解析后.xpath('//div[@class=\"news_title\"]/a/@title')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校',\n",
       " '大型原创舞台剧《春至》圆满落幕',\n",
       " '考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '教师招聘启事',\n",
       " '创意无限，未来可期——',\n",
       " '青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——',\n",
       " '“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——',\n",
       " '2019级新生开学典礼圆满结束',\n",
       " '学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——',\n",
       " '紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——',\n",
       " '2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-A-2 使用 文本\n",
    "解析后.xpath('//div[@class=\"news_title\"]/a/text()')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B-A-3 該你了\n",
    "# 你是否能解釋為什麼 B1 和 B2 結果不一樣 ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据科学家使用xpath的角度\n",
    "![](https://qxf2.com/blog/wp-content/uploads/2015/12/Table.png)\n",
    "* 不挑greedy的策略\n",
    "    * 求全求不漏\n",
    "    * 可能有垃圾\n",
    "    * 使用 // （后代descendant） 而不用  / 子女（children）\n",
    "    * 使用 **任意**元素/标签  而不用  **指定**元素/标签\n",
    "* 挑剔ungreedy的策略\n",
    "    * 求准求数据整齐\n",
    "    * 可能有漏数据\n",
    "    * 使用  / 子女（children） 而不用  // （后代descendant）\n",
    "    * 使用 **指定**元素/标签  而不用  **任意**元素/标签 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 不挑greedy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element 'a' href='/index.php' target='_blank' title='中山大学南方学院'>,\n",
       " <Element 'a' href='http://oa.nfu.edu.cn/' target='_blank'>,\n",
       " <Element 'a' href='http://en.nfu.edu.cn/'>,\n",
       " <Element 'a' href='javascript:;' id='search_btn'>,\n",
       " <Element 'a' href='/index.php'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/29.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/29.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/30.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/135.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/34.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/104.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/2.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/61.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/61.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/36.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/165.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/31.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/31.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/163.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/164.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/106.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/106.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/127.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/107.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/49.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/49.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/129.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/50.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/79.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/79.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/80.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/159.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/159.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/161.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/44.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/44.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/45.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/32.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/32.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/105.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/87.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/51.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/82.html' target='_self'>,\n",
       " <Element 'a' href='/index.php'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6363.html' target='_blank' title='文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6366.html' target='_blank' title='文学与传媒学院2019年学术研讨会暨总结大会顺利召开'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6318.html' target='_blank' title='展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6154.html' target='_blank' title='文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5348.html' target='_blank' title='文学与传媒学院教师招聘启事'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6016.html' target='_blank' title='创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6019.html' target='_blank' title='垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5794.html' target='_blank' title='以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5776.html' target='_blank' title='文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5777.html' target='_blank' title='文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5642.html' target='_blank' title='倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5647.html' target='_blank' title='毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/3.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/4.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/5.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/6.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/7.html'>,\n",
       " <Element 'a' class=('next',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html'>,\n",
       " <Element 'a' href='http://www.moe.gov.cn/' target='_blank' title='教育部'>,\n",
       " <Element 'a' href='http://www.gz.gov.cn/' target='_blank' title='广州市政府'>,\n",
       " <Element 'a' href='http://www.cnki.net/' target='_blank' title='中国知网'>,\n",
       " <Element 'a' href='http://edu.gd.gov.cn' target='_blank' title='广东省教育厅'>,\n",
       " <Element 'a' href='http://www.gdpr.com/' target='_blank' title='珠江投资'>,\n",
       " <Element 'a' href='http://journal.nfu.edu.cn/CN/volumn/home.shtml' target='_blank' title='南方论丛'>,\n",
       " <Element 'a' href='http://www.sysu.edu.cn/' target='_blank' title='中山大学 '>,\n",
       " <Element 'a' href='http://www.nfu.edu.cn/index.php/home/article/index/cid/136.html' target='_blank' title='珠江教育联盟'>,\n",
       " <Element 'a' href='/index.php/home/article/link.html'>,\n",
       " <Element 'a' href='http://www.unsun.net' target='_blank'>,\n",
       " <Element 'a' href='http://www.beian.miit.gov.cn' target='_blank'>,\n",
       " <Element 'a' href='/index.php/admin/index/login.html' target='_blank'>,\n",
       " <Element 'a' href='http://old.nfu.edu.cn/' target='_blank'>,\n",
       " <Element 'a' href='http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=44011702000081' style='display:inline-block;text-decoration:none;height:20px;line-height:20px;' target='_blank'>]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-1 使用 xpath \n",
    "r.html.xpath('//a')  # greedy 不挑 所有 <a> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['中山大学南方学院',\n",
       " '文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展',\n",
       " '教育部',\n",
       " '广州市政府',\n",
       " '中国知网',\n",
       " '广东省教育厅',\n",
       " '珠江投资',\n",
       " '南方论丛',\n",
       " '中山大学 ',\n",
       " '珠江教育联盟']"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-2 使用 xpath 限定 取特定属性\n",
    "# 注意和 B1 的内容相比, 是不是少了一些? \n",
    "# 没有特定属性title就不会被选到\n",
    "解析后.xpath('//a/@title')  # less greedy 有点挑 <a> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['中山大学南方学院',\n",
       " '文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展',\n",
       " '教育部',\n",
       " '广州市政府',\n",
       " '中国知网',\n",
       " '广东省教育厅',\n",
       " '珠江投资',\n",
       " '南方论丛',\n",
       " '中山大学 ',\n",
       " '珠江教育联盟']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-3 使用 xpath  \n",
    "# greedy 不挑元素/标签  只挑有特定属性title的所有元素\n",
    "r.html.xpath('//*/@title')  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 挑剔ungreedy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-4 使用 xpath  # ungreedy 更精準\n",
    "r.html.xpath('//div[@class=\"news_title\"]/a/@title')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element 'a' href='/index.php/home/article/search_detail/id/6363.html' target='_blank' title='文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6366.html' target='_blank' title='文学与传媒学院2019年学术研讨会暨总结大会顺利召开'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6318.html' target='_blank' title='展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6154.html' target='_blank' title='文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5348.html' target='_blank' title='文学与传媒学院教师招聘启事'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6016.html' target='_blank' title='创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6019.html' target='_blank' title='垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5794.html' target='_blank' title='以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5776.html' target='_blank' title='文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5777.html' target='_blank' title='文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5642.html' target='_blank' title='倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5647.html' target='_blank' title='毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展'>]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-5 使用 xpath  # ungreedy 最精准\n",
    "#   xpath 完全没有用 // 也没有用 * \n",
    "r.html.xpath('body/div[@class=\"lin-content\"]/div[@class=\"lin-neiye clearfix\"]/div[@class=\"search_list_right\"]/div[@class=\"ny_content\"]/ul[@class=\"list-ul\"]/li/div[@class=\"news_title\"]/a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Xpath Axis](http://krum.rz.uni-mannheim.de/inet-2005/images/xpath-axis.gif)\n",
    "## 更多xpath\n",
    "xpath是一门在XML文档（包括html，以樹狀為主的純本文結構文檔）中查找信息的语言\n",
    "\n",
    "2. 熟悉 [xpath 语法](https://www.w3cschool.cn/xpath/xpath-syntax.html)丶[xpath 节点](https://www.w3cschool.cn/xpath/xpath-nodes.html)\n",
    "    * 节点\n",
    "        * 元素丶属性丶文本丶命名空间丶文档（根）结点\n",
    "    * 节点关系 \n",
    "        * 父母（parent） vs.先辈（ancestor）\n",
    "        * 子女（children） vs. 后代（descendant）\n",
    "        * 同胞（sibling）\n",
    "3. 使用 [xpath cheatsheet](https://devhints.io/xpath)\n",
    "  * 在 Chrome Inspector 使用\n",
    "  * 在 requests-html (Python) 使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会', '文学与传媒学院2019年学术研讨会暨总结大会顺利召开', '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕', '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束', '文学与传媒学院教师招聘启事', '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行', '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行', '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束', '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖', '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩', '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕', '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']\n"
     ]
    }
   ],
   "source": [
    "# B-C-1 \n",
    "print (r.html.xpath('//div[@class=\"news_title\"]/a/@title'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['/index.php/home/article/search_detail/id/6363.html', '/index.php/home/article/search_detail/id/6366.html', '/index.php/home/article/search_detail/id/6318.html', '/index.php/home/article/search_detail/id/6154.html', '/index.php/home/article/search_detail/id/5348.html', '/index.php/home/article/search_detail/id/6016.html', '/index.php/home/article/search_detail/id/6019.html', '/index.php/home/article/search_detail/id/5794.html', '/index.php/home/article/search_detail/id/5776.html', '/index.php/home/article/search_detail/id/5777.html', '/index.php/home/article/search_detail/id/5642.html', '/index.php/home/article/search_detail/id/5647.html']\n"
     ]
    }
   ],
   "source": [
    "# B-C-2\n",
    "print (r.html.xpath('//div[@class=\"news_title\"]/a/@href'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['2020-01-06', '2020-01-06', '2019-12-20', '2019-11-22', '2019-11-05', '2019-11-04', '2019-11-04', '2019-09-16', '2019-09-09', '2019-09-09', '2019-06-24', '2019-06-24']\n"
     ]
    }
   ],
   "source": [
    "# B-C-3\n",
    "print (r.html.xpath('//font[@class=\"right-more\"]/text()'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['2020-01-06', '2020-01-06', '2019-12-20', '2019-11-22', '2019-11-05', '2019-11-04', '2019-11-04', '2019-09-16', '2019-09-09', '2019-09-09', '2019-06-24', '2019-06-24']\n"
     ]
    }
   ],
   "source": [
    "# B-C-4\n",
    "print (r.html.xpath('//div[@class=\"news_title\"]/preceding-sibling::font/text()'))\n",
    "\n",
    "## 廖老师主张这个 B-C-4代码，会比B-C-3更好，你能不能从B-C-1, 及 B-C-2观察xpath语法，猜猜为什麽"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用pandas 輸出xlsx\n",
    "4. 简易使用 [pd.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'r' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-5-e0c30e556443>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mpandas\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      3\u001b[0m df = pd.DataFrame( {\n\u001b[1;32m----> 4\u001b[1;33m          \u001b[1;34m\"标题\"\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhtml\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mxpath\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'//div[@class=\"news_title\"]/a/@title'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      5\u001b[0m          \u001b[1;34m\"链结\"\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhtml\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mxpath\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'//div[@class=\"news_title\"]/a/@href'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      6\u001b[0m          \u001b[1;34m\"日期\"\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhtml\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mxpath\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'//font[@class=\"right-more\"]/text()'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mNameError\u001b[0m: name 'r' is not defined"
     ]
    }
   ],
   "source": [
    "# B-D-1 pd.DataFrame 建构，pandas课有教\n",
    "import pandas as pd\n",
    "df = pd.DataFrame( {\n",
    "         \"标题\": r.html.xpath('//div[@class=\"news_title\"]/a/@title'),\n",
    "         \"链结\": r.html.xpath('//div[@class=\"news_title\"]/a/@href'),\n",
    "         \"日期\": r.html.xpath('//font[@class=\"right-more\"]/text()'),\n",
    "     } )\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B-D-2 pd.DataFrame 输出excel，pandas课有教\n",
    "df.to_excel(\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.xlsx\", sheet_name=\"搜查结果\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 本周小结内容\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 打开Excel档看成果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 课后练习及下周项目m.liepin.com\n",
    "\n",
    "使用 xpath 应用 [m.liepin.com](https://m.liepin.com/zhaopin/)\n",
    "\n",
    "你是数据科学家，这m.liepin.com有什麽样的牛肉，你打算要怎麽抓？\n",
    "* 工作名称\n",
    "* 工作地点\n",
    "* 工作$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'HTMLSession' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-4-0ef6a321e7ef>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mpandas\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      3\u001b[0m \u001b[0murl\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m\"https://m.liepin.com/zhaopin/?keyword=pandas\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 4\u001b[1;33m \u001b[0msession\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mHTMLSession\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      5\u001b[0m \u001b[0mr\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msession\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m \u001b[0murl\u001b[0m \u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mNameError\u001b[0m: name 'HTMLSession' is not defined"
     ]
    }
   ],
   "source": [
    "# C-1   单一页面\n",
    "import pandas as pd\n",
    "url = \"https://m.liepin.com/zhaopin/?keyword=pandas\"\n",
    "session = HTMLSession()\n",
    "r = session.get( url )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C-2 保存备用\n",
    "with open (\"20春_Web数据挖掘_week02_zhaopin_pandas.html\", encoding = \"utf8\", mode = \"w\") as fp:\n",
    "    fp.write(r.html.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C-3\n",
    "# 易: '职称', '链结', '薪水'\n",
    "\n",
    "# 你的代码?\n",
    "\n",
    "数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C-4\n",
    "# 中: '公司地点', '公司名称'\n",
    "\n",
    "# 你的代码?\n",
    "\n",
    "数据 = pd.DataFrame(数据字典)\n",
    "数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C-5\n",
    "# 难: '公司URL', '时间', '经验'\n",
    "\n",
    "# 你的代码?\n",
    "\n",
    "数据.to_excel(\"20春_Web数据挖掘_week02_liepin.xlsx\", sheet_name=\"搜查结果\")\n",
    "数据 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "749px",
    "left": "1125.609375px",
    "top": "110px",
    "width": "281.390625px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
